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Abstract 



Repeated interaction between individuals is the main mechanism for maintaining cooperation 
in social dilemma situations. Variants of tit-for-tat (repeating the previous action of the op- 
ponent) and the win-stay lose-shift strategy are known as strong competitors in iterated social 
dilemma games. On the other hand, real repeated interaction generally allows plasticity (i.e., 
learning) of individuals based on the experience of the past. Although plasticity is relevant to 
various biological phenomena, its role in repeated social dilemma games is relatively unexplored. 
In particular, if experience-based learning plays a key role in promotion and maintenance of 
cooperation, learners should evolve in the contest with nonlearners under selection pressure. 
By modeling players using a simple reinforcement learning model, we numerically show that 
learning enables the evolution of cooperation. We also show that numerically estimated adap- 
tive dynamics appositely predict the outcome of evolutionary simulations. The analysis of the 
adaptive dynamics enables us to capture the obtained results as an affirmative example of the 
Baldwin effect, where learning accelerates the evolution to optimality. 



1 Introduction 



The mechanisms of cooperation in social dilemma situations are a central topic in interdis- 
ciplinary research fields including evolutionary biology, ecology, economics, and sociology. As 
analyzed by the prisoner's dilemma (PD) game and its relatives, direct reciprocity is among the 



main known mechanisms underlying cooperative behavior ( Trivers, 1971 Axelrod, 1984 ). In di- 
rect reciprocity, iterated interaction between the same individuals motivates them to continue 
cooperating (C) rather than to defecting (D) to obtain momentarily large payoff's; defection 
would be negatively rewarded by the opponent player's retaliation in later rounds. Variants of 
the celebrated retaliatory strategy tit-for-tat (mimicking the opponent's action in the previous 
round) QNowak fc Sigmund, 1992 ) and a win-stay lose-shift strategy ( Kraines Kraines, 1989" 



Nowak fc Sigmund, 1993]) are recognized as strong competitors in the iterated PD game. 



In the iterated games concerning direct reciprocity, it is natural to assume that players mod- 
ify their strategies in response to their experiences in past rounds. The tit-for-tat, its variants, 
and win-stay lose-shift strategies can be interpreted as examples of such learning strategies 
because the tit-for-tat, for example, implies that the player selects the action (i.e., C or D) 
depending on the result of the last round. A more sophisticated learning player of this kind 
exploits a longer history of the game for action selection (e.g., cooperate if the player and the op- 
ponent cooperated in the previous two rounds, and defect otherwise) ( |Lindgren, 1991 ). Classes 
of other learning models include fictitious play and reinforcement learning QCamerer, 2003 



Fudenberg fc Levine, 1998). Learning apparently seems beneficial in iterated games because 



learning players are more flexible than nonlearning players. 

If learning is a key factor in promoting cooperation in real societies, the number of learn- 
ing players should increase when a population evolves under selection pressure. However, 
the advantage of learners over nonlearners in evolutionary dynamics is elusive because a pair 
of learning players often results in mutual defection QMacy, 1996[ |Sandholm Crites, 1996 



Posch et a/.,T999||Taiji fc Ikegami, 1999||Macy fc Flache, 2002t|Masuda fc Ohtsuki, 2009]) and 



learning may be costly. 

The constructive roles of learning in the evolution of certain traits are collectively called the 
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Baldwin effect (see Simpson, 1953; Turney et ai, 1996; Weber & D. J. Depew, 2003; Crispo, 
2007; Badyaev, 2009 for reviews). Although earlier examples of the Baldwin effect are not neces- 
sarily founded on firm empirical evidence ( Simpson, 1953 Weber fc Depew, 2003 ), there exists 
a plethora of positive evidence of the Baldwin effect. Examples include fly's morphological de- 



velopments (Waddington, 1942), colonization of house finch in North America (Badyaev, 2009), 



and persistence of coastal j uncos ( Yeh fc Price, 2004 ). In fact, the concept of the Baldwin 
effect differs by authors (see Simpson, 1953; Downes, 2003; Turney et ai, 1996). Although ear- 
lier computational models suggest that learning accelerates evolution (Hinton fc Nowlan, 1987 



Ancel, 1999| |Maynard Smith, 1987[), later theoretical and numerical studies suggest that learn- 



ing either accelerates or decelerates evolution toward the optimum depending on the details 
of the models (Ancel, 2000 Dopazo et ai, 2001 Borenstein et ai, 2006 Paenke et ai, 2007 



Paenke et at, 2009). The advantage of learning in evolution is also nontrivial in this broader 



context. 

We numerically investigate the effect of learning on evolution in the iterated PD game. This 
question was explored in previous literature (Suzuki & Arita, 2004; also see Wang et ai, 2008 for 
discussion). Our emphasis in this study is to use a reinforcement learning model for the iterated 
PD game (Masuda & Nakamura, 2011) that is much simpler in terms of the number of plastic 
elements than the plastic look-up-table model adopted in QSuzuki fc Arita, 2004[ ). In our model, 
players are satisfied with and persist in the current action when the obtained payoff is larger than 
a plastic threshold. Our model of players introduced in (Masuda fc Nakamura, 2011) modifies 



those in ( |Karandikar et ai, 1998] |Posch et ai, 19991 |Macy fc Flache, 2002[ ). Via the stability 
analysis for nonlearning players, the numerical analysis of the discretized adaptive dynamics 
with nonlearning and learning players, and full evolutionary simulations, we show that learning 
is needed for a noncooperative population to evolve to be able to engage in mutual cooperation 
for wide parameter ranges. We also discuss our results in the context of the Baldwin effect. 
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2 Model 



2.1 Iterated PD game 

We assume that each player plays the PD game against each of the other players in a population. 
In each round t (t = 1, 2, . . .) within a generation, a player selects C or D without knowing 
the action (i.e., C or D) of the opponent player. The payoff to the focal player is defined by 



C D 
C (R S 
D I T P 



(1) 



where T > R > P > S and R > (T + S)/2. Equation ([T]) represents the row player's payoff. 
The payoff to the opponent (column player) is defined likewise; the PD game is symmetric. 
Because T > R and P > S, mutual defection is the only Nash equilibrium of the single-shot 
PD game. 

However, players may continue mutual cooperation for their own benefits in the iterated 



PD game (Trivers, 1971 Axelrod, 1984). We denote the number of rounds per generation by 



tma.x- Technically, the Nash equilibrium of the iterated PD game is perpetual mutual defection 
if the players know t max beforehand. The number of rounds is often randomized to avoid this 
effect (Axelrod, 1984). To simplify the analysis, we assume that the players are unaware of the 



fixed value of t max - 

Earlier studies identified tit-for-tat, which involves imitating the previous action of the oppo- 
nent, as a strong strategy in the iterated PD game when various strategies coexist in a popula- 



tion (|Trivers, 1971| |Axelrod, 1984[). However, later studies showed that tit-for-tat is not robust 



against error and that alternative strategies such as generous tit-for-tat (Nowak fc Sigmund, 1992) 



and Pavlov (Kraines &: Kraines, 1989" Nowak fc Sigmund, 1993) are strong competitors in the 



iterated PD game with error. By definition, a Pavlov player receiving payoff T or R is satisfied 
and does not change the action in the next round, whereas the same player receiving payoff P 
or S is dissatisfied and flip the action. A population composed of Pavlov players, for example, 
realizes mutual cooperation such that a player gains approximately R per round. 



3 



2.2 Reinforcement learning 



Intuitively, the ability to learn may seem to be an advantageous trait in the iterated PD game 
if the cost of learning is negligible. However, this is generally not the case. A pair of learn- 
ing players often ends up with mutual defection unless a learning algorithm is carefully designed 
(|Macy, 1996]|Sandholm fc Crites, 1996j|Posch et al, 1999||Taiji fc Ikegami, 1999||Macy fc Flache, 2002 



Masuda Ohtsuki, 2009]) . Learning requires trial and error, i.e., the exploration of unknown 



behavioral patterns as well as the exploitation of known advantageous behavioral patterns. Ex- 
ploratory behavior of a learning player may look just random to opponents, and it is rational 
to defect against random-looking players. 

To compromise the possibility of mutual cooperation, the simplicity of the learning algo- 
rithm, and the biological plausibility of the model as compared to some other learning algo- 
rithms, we use a variant of the Bush-Mosteller (BM) reinforcement learning model (Masuda fc Nakamura, 



This model modifies the models in the previous literature (Karandikar et al, 1998 Posch et al, 1999 



Macy Flache, 2002[) such that players learn to mutually cooperate for wide parameter ranges. 



In round t, the cooperability of the learning player is given by the probability p t . We update 
p t using the results of the single-shot PD game as follows: 



Pt+i 



p t + (1 — Pt)st (action in round t is C, and s t > 0), 

Pt + PtSt (action in round t is C, and s t < 0), 

Pt — Vt s t (action in round t is D, and s t > 0), 

Pt — (1 — Pt)st (action in round t is D, and s t < 0), 



(2) 



where 

s t = tanh[/3(r t - A t )}, (3) 

and r t £ {R, T, S, P} is the payoff to the player in round t. s t stands for the degree of satisfaction 
in round t. When s t is large, the player increases the probability of taking the current action 
in round t + 1. For example, the third line in Eq. ([2]) indicates that the player decreases the 
probability of cooperation p t because selecting D in round t has yielded a satisfactory outcome. 
In addition, we assume that the player misimplements the action with a small probability e 
such that the player in fact cooperates with probability (1 — 2e)p t + e in round t. Equations ()2]) 
and ([3]) indicate that the player is satisfied with the current situation if the obtained payoff r t 
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is larger than the so-called aspiration level A t . Otherwise, the player is motivated to flip the 
action. (3 controls the sensitivity in the plasticity of p t . If /? = 0, s t — for any t such that p t 
is constant. If /3 = oo, St = 1 or —1 for any t such that pt+i = 1 or 0. 

Unless otherwise stated, we set the initial condition to p\ = 0, i.e., the player defects in 
round 1. This value of p\ is the most adverse to mutual cooperation. We will confirm in 



Sees. 13.11 and 13.41 that our main results are qualitatively the same if we set p\ — 1. 
The dynamics of the aspiration level are given by 

A t+1 = (1 - h)A t + hr t , (4) 

where h represents the learning rate. If h — 0, At is constant, and the model is equivalent 
to the classical BM model. If h = 1, the player compares the current payoff and the payoff 
obtained in the last round to determine St- In our previous work, we showed that mutual 
cooperation is established among the players only after t max = 100 rounds if (3 is large and h is 
small (Masuda fc Nakamura, 2011). In the numerical simulations, we set — 3, which is large 



enough to support mutual cooperation if other conditions, such as small h and small e, are met. 
We remark that the initial condition A\ is a key parameter to characterize the player. 

2.3 Evolutionary dynamics 

We set the number of players in the population to N = 500. In a single generation, each 
player i plays the iterated PD game with t max = 200 against all the other players. We always 
reset pt and A t to p\ and A\ when a player starts the iterated PD game with a new opponent. 
The single generation payoff Fj (g [S, T]) is equal to the summation of the payoff obtained by 
playing against iV — 1 players, which is divided by (N — l)£ max . 

After the single generation payoffs to all the players are determined, we select two players i 
and j with equal probability for strategy update. We use the Fermi rule QSzabo fc Toke, 1998 



Traulsen et al, 2006 ) in which player i adopts j's Ai and h values in the next generation with 
probability 1/ 1 + exp (j3(ji — fj)J , and player j adopts i's parameter values, otherwise. We 
set /3 = 1. To account for mutation, we assume that after strategy update, A\ and h of the 
adopter are displaced by random small values obeying the uniform density on [— A^A^J 



and [—Ah, Ah], respectively. If the displaced h exceeds 1 or is negative, we reset h to 1 or 0, 
respectively. However, the resetting seldom occurs in our evolutionary simulations. 

The phenotype of a player in round t is specified by p t and A t . It should be noted that pt 
and At are not inherited over generations. In other words, the natural selection operates on the 
capacity to learn (i.e., h) but not on the acquired behavior (i.e., p t and A t ). Because we let /3 in 



Eq. ([3]) to be relatively large to realize mutual cooperation ( Masuda Sz Nakamura, 2011 ), p t is 
sensitive to the excess payoff relative to A t in the sense that pt is close to or 1 unless r t is close 
to A t . Therefore, p t is similar to the probability of cooperation conditioned on the outcome 
of the PD game in the previous round. When we use the term learning in the following, we 
exclusively refer to that induced by h in the iterated PD game. A positive value of h directly 
raises the plasticity of A t and indirectly controls that of pt- 

3 Results 

3.1 Nash equilibria when without learning 

To show that learning is necessary for the emergence of cooperation, we start by analyzing 
the competition between players that do not learn. With learning rate h equal to zero, the 
aspiration level is fixed over rounds (i.e., A t = A\, t > 1). For the sake of analysis, we set 
j3 — oo. Then, Eqs. (T2]) and © imply that the player persists in the current action (i.e., 
(pt,Pt+i) = (0,0) or (1, 1)) if r t - A t > and flips the action (i.e., (p t ,p t+ i) = (0, 1) or (1,0)) 
otherwise. When A t is fixed, there are five strategies: 

• Strategy stl is defined by A t < S. Except for the action misimplemantation, an stl player 
always cooperates or always defects, depending on the action in the first round. 

• Strategy st2 is defined by S < A t < P. An st2 player does not flip the action unless 
r t = S. 

• Strategy st3 is defined by P < A t < R. An st3 player does not flip the action if mutual 
cooperation or unilateral defection is realized. It is equivalent to Pavlov, which is a strong 
competitor in the iterated PD game (|Kraines fc Kraines, 1989{|Nowak fc Sigmund, 1993[). 
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• Strategy st4 is defined by R < A t < T. An st4 player flips the action unless r t = T. 

• Strategy st5 is defined by T < A t . An st5 player flips the action in every round except 
when the player misimplements the action. 

In Table [JJ the average payoff to a nonlearning (i.e., h = 0) player (row player) playing 
against another nonlearning player (column player) is shown for < e < 1 and t max = oo. 
For example, stl playing against st2 obtains (R + 3P + 2S)/6 per round on an average. The 
results shown in Table [TJ are a subset of those obtained in QNowak et al, 1995] ) (see Appendix 



A for details). Table [T] indicates that st3 is a Nash equilibrium when the five strategies are 
considered. In particular, st3 playing against another st3 realizes mutual cooperation and 
obtains the largest average payoff per round R. Therefore, a unanimous population composed 
of st3 players represents a eusocial situation. Mutual cooperation is not realized by any other 
combination of two players. 

Table [JJ indicates that st2 is also a Nash equilibrium when 3P > R + 2S. In addition, 
although st4 is not a Nash equilibrium, a homogeneous population composed of st4 players is 
resistant to invasion by st3 in evolutionary situations because st4 gains a larger payoff than st3 
does when playing against an st4 opponent. 

To test the robustness of the results shown in Tabled], we set (3 = 3, e = 0.02, and h = 0, 
and numerically calculate the payoff averaged over t max generations to different nonlearning 
players with different fixed aspiration levels A\. We also set R — 4, T = 5, S — 0, and 
P = 2 in this and the following numerical simulations. The average payoff to a nonlearner 
playing against another nonlearner is shown for p\ = and t max = 200 in Fig. [TJ The presented 
values are averages over 100 trials for each pair of A\ values. The results shown in Fig. [T]are 
qualitatively the same as those shown in Table [TJ We also confirmed that the results hardly 
change for (pi,t max ) = (0,2000), (1,200), and (1,2000). 

3.2 Possibility of mutual cooperation via learning 

If h > 0, players different from st3 may adjust A t until P < A t < R is satisfied such that they 
learn to behave as Pavlov. Therefore, learning may play a constructive role in the evolution 
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of mutual cooperation. In fact, this is not always the case; e > is a necessary condition for 
mutual cooperation to evolve. 

To explain this point, we set /3 = 3 and h = 0.1, and numerically examine the behavior 
of a pair of players. Typical time courses of the aspiration level for a pair of learning players 
over rounds without action misimplementation (i.e., e = 0) are shown in Fig. 0(a). Each of the 
three pairs with close A\ values represents a pair of stl (thick lines), st3 (dotted lines), and st5 
players (medium lines), respectively. We used different values of A\ for each pair for the clarity 
of the figure; making A\ equal for two players does not qualitatively change the results. The 
thick lines in Fig. [2(a) indicate that the two stl players playing with each other are satisfied 
with payoff P = 2 obtained by mutual defection. Therefore, their aspiration levels converge to 
At = P. The results would be the same if we start from a pair of st2 players or a combination 
of an stl player and an st2 player. A pair of st3 players begin mutual cooperation from the 
second round, and their A t values converge to R = 4 (dotted lines). Mutual cooperation is 
also realized if the two players are initially either st4 or st5, although some rounds are required 
before the players mutually cooperate (medium lines). 

Although two learning players having A t < P do not end up with mutual cooperation 
when e = 0, the action misimplementation (i.e., e > 0) can trigger a shift from mutual de- 
fection to mutual cooperation. Artificially generated time courses in the presence of action 
implementation are shown in Fig. [2](b) for expository purposes. Until the intended action is 
misimplemented (1 < t < 29 in Fig. [2(b)), two players starting with A\ < P keep mutual 
defection (thick lines). When A t has sufficiently approached P, we assume that one player mis- 
implements the action (t = 30). Then, the A t values of both players cross P from below within 
a couple of rounds such that the players start to behave as Pavlov and mutually cooperate. 
The possibility of mutual cooperation through this mechanism is sensitive to the value of h. 
Two players starting with A\ < P end up with A t > P owing to the action misimplementation 
when (P — S)/(T — P) < (1 — h) 2 (see Appendix B for derivation). When T = 5, S = 0, and 
P = 2, this condition yields < h < 1 — a/2/3 ~ 0.184 for an arbitrary value of R. Then, the 
A t values of the two players converge to R. 

The A t values also converge to R when we start with a pair of st3 players (dotted lines in 
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Fig. |2fb)) and a pair of st4 or st5 players (medium lines in Fig. |2Jb)). This is because mutual 
cooperation is stable against action misimplementation; if one player turns into D by action 
misimplementation in round t, both players defect in round t + 1 and cooperate in round t + 2, 
if the actions are not misimplemented in rounds t + 1 and t + 2. This event sequence is likely 
unless e is large. 



3.3 Adaptive dynamics 

In the evolutionary numerical simulations that we will describe in Sec. I3.4[ we allow the initial 
aspiration level A\ and learning rate h to mutate (Sec. I2.3p . If the distribution of A± and that 
of h for an evolving population are single peaked and sufficiently localized, we can grasp the 
evolutionary dynamics for a population by tracking the dynamics of the population averages 
of A\ and h, denoted by Ai and h, respectively. In the extreme case in which all the players 
share identical values of A\ and h, the instantaneous dynamics of A\ and h are captured by 
adaptive dynamics QMetz et ai, 1996] |Hofbauer fc Sigmund, 1998] |Hofbauer Sigmund, 2003 



Doebeli et ai, 2004 ). Adaptive dynamics reveal the possibility for mutants with a slightly 
deviated parameter value to invade a homogeneous resident population. In this section, we 
numerically examine two-dimensional adaptive dynamics with respect to A\ and h to foresee 



the evolutionary simulations carried out in Sec. 13.41 

In this and the following sections, we set (3 = 3, pi = 0, e = 0.02, and t max = 200 unless 
otherwise stated. Consider a homogeneous population of players sharing the parameter values 
A\ = Ai and h = h. A mutant player with aspiration level A[ and learning rate h! can invade 
the population if 

7r[s', s] - 7r[s, s] > (5) 

or 

7r[s', s] — 7r[s, s] = and ir[s', s'] — ir[s, s'] > 0, (6) 

where s = (A\,h) and s' = (A[,h') are the strategies of the resident and mutant players, 
respectively, and 7r[si, S2] represents the average payoff of strategy si when playing with strategy 
s 2 . s is ESS if the converse of Eq. (J5]) or the converse of Eq. §6§ is satisfied. If Eq. (J5J) or (jSJ) 
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is satisfied, the homogeneous population comprising strategy s would evolve toward s'. We 
numerically calculate 7r[s', s] — ir[s, s], where s' = (Ai + 0.2, h), (A\ — 0.2, h), (A±, h + 0.02), 
and (Ai, h — 0.02). We confine s' in the neighborhood of s because the amount of mutation 
for A\ and h is assumed to be small. Examining n[s', s] — tt[s, s] corresponds to looking at the 
discretized adaptive dynamics, i.e., the discretized derivative of vr[s',s] with respect to s' at 
s' = s. 

For various values of A\ and h, ir[s',s] — tt[s,s] is shown in Fig. [3j The plotted values 
are averages over 10 4 runs for any s. In Fig. [3](a), s' = (Ax + 0.2, h) obtains a larger payoff 
than s = (Ai,h) in the red region. In this region, s' would invade a homogeneous resident 
population of s such that A\ increases. In contrast, s' obtains a smaller payoff than s does in 
the blue region. Figure [3(a) indicates that, if learning is prohibited (i.e., h = 0), the population 
starting from Ai = 0, for example, is expected to evolve such that Ai increases, but only up to 
Ai ~ P = 2. Therefore, a population does not evolve from st2 to st3 without learning. 

Figure [3(b), which reveals the possibility of invasion by mutant s' = (A\ — 0.2, h) in the 
resident population of s, is a sign flipped version of Fig. [3]^a) in most parameter regions. 
Nevertheless, neither the mutants with A[ = A± + 0.2 nor the ones with A[ = A\ — 0.2 invade 
the resident population (i.e., parameter regions colored in blue in both Figs. El^a) and^b)) for 
(Ai, h) PS (2, 0) and along a bent line passing through (A\, h) ps (4, 0) and (A\, h) ps (4.3, 0.2). 
These regions constitute singular points of the adaptive dynamics and serve as repellers. In 
other words, A\ does not pass through p^ P = 2 for h = and ps R = 4 for various values of h 
in adaptive dynamics. The observations for h = that the homogeneous population of st2 is 
not invaded by st3 mutants, that of st3 is not invaded by st2 or st4 mutants, and that of st4 is 
not invaded by st3 mutants, are consistent with the results obtained in Sec. 13.11 

The possibility of invasion by mutant s' = (A±, h + 0.02) in the homogeneous population of 
s = (A±, h) is shown in Fig. [3](c) . The figure suggests that h would increase for a population 
of stl players (i.e., A\ < S = 0). Learning is preferred to nonlearning when A\ < for the 
following reason. As shown in Sec. 13. 2\ when h > and e > 0, A t increases until the players 
behave as Pavlov to mutually cooperate within a relatively small number of rounds (Fig. 12(b)). 
In contrast, the players do not establish mutual cooperation when h = or e = 0, as shown in 
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Sec. 13.11 (Fig. |2Ja)). Figure [3^c) indicates that h increases up to h ~ 0.15. This value of h is 
consistent with the upper bound of h for mutual cooperation to be possible, which was derived 
in Sec. 13.21 Based on these results, h is expected to initially increase in evolutionary dynamics 
starting with a population of nonlearning stl players. We refer to the stage of evolutionary 
dynamics in which h increases as stage 1. The existence of stage 1 is also supported by Fig.E^d) 
in which the mutant has s' = (A\, h — 0.02). 

After h has increased, Figs. |3]^a) and EJ^b) imply that A\ increases to cross P = 2. When 
h > 0, a larger value of A\ (< P) is beneficial because fewer rounds are required for such 
players to turn to Pavlov (i.e., A t > P). Once A\ exceeds P for a majority of players, they 
earn a large average payoff ~ R through mutual cooperation. We refer to the transition for 
learning players from a small A\ corresponding to stl or st2 to a large A\ corresponding to st3 
as stage 2. Figures E^a) andE^b) indicate that the difference between tt[s', s] and tt[s, s] when 
Ai, Ai + 0.2 < P, h > is small, presumably because s and s' are only slightly different in 
terms of the number of transient rounds before the entrance to A\ > P. Therefore, we expect 
that stage 2 occurs slowly in evolutionary dynamics. 

Although it is a minor phenomenon as compared to stages 1 and 2, a smaller h is more 
beneficial on the boundary between st2 and st3 (i.e., A t « P = 2), as shown in Figs. EIc) and 
El^d). For expository purposes, time courses of the iterated PD game between an st2 player and 
an st3 player are shown in Fig. El^e). As shown by the solid lines, the initial st3 player flips 
to st2 before establishing mutual cooperation if h > 0. In fact, a nonlearning st3 player (i.e., 
h = 0) realizes mutual cooperation with a learning st2 player in earlier rounds (dotted lines) 
than a learning st3 player does (solid lines). Therefore, in evolutionary dynamics, h in the 
vicinity of A t w P is expected to decrease. We refer to this transition as stage 3. It should be 
noted that stage 3 occurs in a narrow range of Ax (i.e., A\ < P and A\ + 0.2 > P in Fig. Ela)). 

Through stages 1, 2, and 3, evolution from a defective population of nonlearning stl players 
to a cooperative population of st3 players is logically possible. In contrast, the emergence of 
mutual cooperation is hampered if learning is prohibited. 

After stage 3, A\ would not evolve beyond R; A\ rs R is a line of repellers in adaptive 
dynamics, as already explained in Figs. Eta) andE^b). When P < A\ < R (i.e., 2 < A\ < 4), 
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the mutant's payoff is indistinguishable from the resident's payoff unless h is large (Figs. [2(a)- 
(d)). Therefore, A\ and h would perform approximately unbiased diffusion. This implies that 
h that has decreased via stage 3 may increase again. 

When at least one of the two players is st4 or st5, a player with a larger A\ is more 
advantageous than the opponent with a smaller A\. This is because the former exploits the 
latter in early rounds. Nevertheless, these players do not obtain the average payoff as large as 
that for a pair of st3 players, which would start to mutually cooperate from the second round. 
Therefore, st3 is stable against invasion by st4 and vice versa. 

We predict that the learning rate would not eventually decrease to the small value in evolu- 
tionary simulations. In other words, the disadvantage of learning is too small to be evolution- 
arily relevant unless the cost of learning is explicitly incorporated. 

To assess the robustness of the results obtained from the adaptive dynamics, we reproduced 
Figs. |2(a)-(d) with e = 0.05 and e = 0.1. The results for e = 0.05 are qualitatively the same 
as those for e = 0.02 (results not shown). The results for e = 0.1 are different in some aspects 
from those for e = 0.02 (Fig. HJ). Most notably, when e = 0.1, st3 is no longer stable against 
invasion by st2 even without learning (i.e., h — 0). Therefore, mutual cooperation would 
not be stable in evolutionary dynamics. In Figs. [5(a) and [5(b), 7r[s', s] — tt[s,s] is shown for 
(s,s') = ((1.9, 0), (2.1, 0)) and (s,s') = ((2.1, 0), (1.9, 0)), respectively, for a variety of values 
of tmax and e. Figure [5(a) indicates that an st3 mutant does not invade the population of st2 
residents for all the examined values of £ max and e. Figure [5(b) indicates that a population of 
st3 residents is resistant to invasion by st2 mutants when t max is large and e (> 0) is small. 
Nevertheless, st3 is stable for various values of t max and e. Because stage 2 is hampered when 
e = 0, e must take an intermediate value for the learning-mediated mutual cooperation to 
emerge. 

3.4 Evolutionary simulations 

The results in Sec. 13.31 predict the presence of a learning mediated evolutionary route from 
a noncooperative population composed of stl players to a cooperative population composed 
of st3 players. In this section, we carry out direct numerical simulations of the evolutionary 
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dynamics using a population composed of N = 500 players. We initially set h = and select 
A\ for each player independently from the uniform density on [5 — 1,5]. Therefore, all the 
players are initially nonlearning stl. Refer to Sec. 12.31 for details of the numerical setup. 

The evolution of h, the total amount of plasticity experienced in a generation, defined 
by X]t=i x_1 l^t+i — At\, f, and the fraction of mutual cooperation for an example run with 
A J 4 1 = 0.05 and A/, = 0.01 are shown in Fig. |6]^a). The average learning rate h and the total 
amount of plasticity rapidly increase until the ~ 3.9 x 10 4 th round. The payoff and the fraction 
of mutual cooperation also increase during this period because stl players learn to behave as 
Pavlov when h > 0. This period corresponds to stage 1 described in Sec. 13.31 Then, the 
fraction of mutual cooperation and the total amount of plasticity gradually increase until the 
« 3.5 x 10 5 th round, corresponding to stage 2. In the ~ 3.5 x 10 5 th round, an st3 mutant 
emerges in the population mostly composed of st2 players and gains a larger payoff than st2 
residents do. Then, st3 players rapidly replace st2 players in the population such that rj R and 
the fraction of mutual cooperation suddenly increase (Fig. |6^a)). This is because stable mutual 
cooperation between st3 players emerges in an early round, whereas that between st2 players 
emerges after « 2/e rounds. The learning rate decreases almost at the same time, corresponding 
to stage 3. The time courses of the fractions of stl, st2, st3, st4, and st5 players corresponding to 
the run shown in Fig. E^a) are shown in Fig. EJ^b). For example, the fraction of the stl player is 
defined by the fraction of players having A\ < 5 and any value of h. Figure [6](b) indicates that 
the population initially composed of stl players evolves to that of st3 players. The trajectory 
of Ai and h corresponding to the same run is shown in Fig. [6](c). Figure [6](c) is consistent with 
the scenario of the evolution of cooperation described in Sec. 13.31 The population evolves from 
no cooperation to mutual cooperation via the three stages involving learning. After stage 3, Ax 
and h diffuse without a recognizable bias, which is also consistent with the results obtained in 
Sec. 13.31 (white regions in Figs. E)^a)-(d)). However, it should be noted that the total amount 
of plasticity remains small after stage 3. 

To examine the robustness of the results, we carry out five runs of numerical simulations 
for each of the different parameter sets; we could not carry out more extensive numerical 
simulations because of the computational cost. We measure two quantities in each run. The 
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first quantity is the number of generations necessary for h to exceed 0.1 for the first time. 
We call this number the end of stage 1. The second quantity is the number of generations 
necessary for A\ to exceed P for the first time. We call this number the end of stage 2. The 
ends of stages 1 and 2 with e = 0.02, A Al = 0.05, and A h = 0.01 are equal to 257 ± 28 (xlO 2 ) 
and 3763 ± 1514 (xlO 2 ), respectively, where the mean ± standard deviation on the basis of 
the five runs are indicated. Those with e = 0.05, A Al = 0.05, and A^ = 0.01 are equal to 
299 ±55 (xlO 2 ) and 6951 ±4231 (xlO 2 ). Those with e = 0.01, A Al = 0.05, and A h = 0.01 
are equal to 265 ± 31 (xlO 2 ) and 3780 ± 1547 (xlO 2 ). Those with e = 0.02, A Al = 0.02, and 
A h = 0.01 are equal to 239 ± 18 (xlO 2 ) and 9151 ± 6836 (xlO 2 ). For this parameter set, one 
out of the five runs did not reach the end of stage 2 within 2 x 10 6 generations, such that the 
statistics are based on the other four runs. Those with e = 0.02, A Al = 0.05, and A^ = 0.005 
are equal to 613 ±119 (xlO 2 ) and 3504 ±999 (xlO 2 ). Mutual cooperation evolves via learning 
(i.e., finite value of the end of stage 2 up to our numerical efforts) in most cases. When e = 0.05, 
evolution to mutual cooperation is slower than when e = 0.02. This may be because learning 
players having different values of A\ turn into Pavlov (i.e., A t > P) within a small number of 
rounds when e is relatively large. Then, the payoff to different learning players would differ 
relatively little to weaken the selection pressure. 

We perform another robustness test. For the original parameter values e = 0.02, A Al = 0.05, 
and Ah = 0.01, the trajectory of A\ and h obtained from a single run with p\ = 1 is shown 
in Fig. [7J The results are qualitatively the same as those for pi = (Fig. El^c)) although 
establishment of cooperation takes a considerably larger number of generations when pi = 1 
than when p\ = 0. 

3.5 Baldwin effect 

If we assume an explicit cost of learning, the learning rate decreases after mutual cooperation 
is reached. An example time course of A\ and h when a linear cost — ch is added to the single 
generation payoff to each player (Suzuki & Arita 2004; see Ancel 1999, 2000 for a different 
implementation of the explicit learning cost), where c = 1, is shown in Fig. [H^a). The final 
value of h is smaller than that in the case without the learning cost (Fig. EJ^c)). The result 



11 



shown in Fig. E(d) is an example of the standard Baldwin effect in which the learning rate 
initially increases and then decreases flAncel, 2000[ |Dopazo et ai, 2001] |Borenstein et al, 2006 



Paenke et at, 2007 Paenke et at, 2009). 



We examine the robustness of the observed Baldwin effect against the variation of c. An 
example time course of A\ and h when c = 10 is shown in Fig.[H£b). As compared to when c = 1 
(Fig. E^a)), h is smaller throughout the evolution, and A\ increases more slowly. Nevertheless, 
the population mostly consists of st3 players in the end. We carried out five runs for various 
values of c. The largest h value in 10 6 generations is shown in Fig. [8](c) for each run. The 
largest h value decreases with c because learning is costly for a large value of c. The final value 
of h, calculated as the average over the last 10 4 generations, is shown in Fig. [8(d). The final 
value of h is considerably smaller than the largest value (Fig. [8(c) ) for each c, indicative of 
stage 3 of the Baldwin effect. The final value of A\, calculated as the average over the last 10 4 
generations, is plotted against c in Fig. E£e). If this value is larger than P = 2 and smaller 
than R = 4, we expect that the final population is mostly composed of st3 players and that 
the Baldwin effect is operative. Figure Wfi) suggests that the Baldwin effect occurs in the five 
runs when c < 15. When c > 15, stage 1, i.e., the initial increase in h, is often too small in 
magnitude such that stage 2 does not sometimes occur. We conclude that the Baldwin effect 
occurs for a wide range of c. 



4 Discussion 

We have shown that reinforcement learning promotes the evolution of mutual cooperation in 
a population of players involved in the iterated PD game. Cooperation evolves under some 
conditions such as 3P > R + 2S, positive but not too large values of e, and t max that is not too 
small. The present study is motivated by previous investigations of the Baldwin effect. Our 
results provide an example of the Baldwin effect in the form of a computational model of social 
behavior. 

To understand the behavior of our model analytically, writing down the Fokker-Planck 
equation for the joint density of A\ and h may be useful. Starting from the singular density at 
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a small value of A\ and h = 0, we may be able to solve the Fokker-Planck equation numerically 
to track the evolution of the joint density to find the Baldwin effect. Alternatively, discretizing 
A\ and h and then formulating a Markov chain on the discretized states may also be useful. 
Nevertheless, we refrained from such analyses because we consider that they eventually neces- 
sitates some numerical simulations and would not sufficiently advance the understanding of our 
numerical results. 



The concept of the Baldwin effect is diverse QSimpson, 1953[ |Downes, 2003[[Turney et at, 1996] ) . 
However, arguably, the most accepted variant of the Baldwin effect is formulated as a two-stage 



mechanism ( |Simpson, 1953| Godfrey-Smith, 2003 ; |Crispo, 2007[ Turney et al, 1996 ). In stage 
1, plasticity increases because plastic individuals are better at finding the optimal behavior 
than nonplastic individuals. In stage 2, mutation makes the optimal behavior innate and de- 
creases the plasticity of individuals. Mutants that play optimally from the outset of their life 
without plasticity and resident individuals that acquire the optimal behavior through plastic- 
ity are eventually equally efficient. Nevertheless, because of the cost of learning, the mutants 
overwhelm the residents via natural selection. Stage 2 is often called genetic assimilation. 

Stage 1 in our model corresponds to stage 1 of the standard Baldwin effect outlined above. 
In stage 2 in our model, Ai increases such that the optimal behavior (i.e., mutual cooperation 
by turning into st3) becomes innate. Nevertheless, after stage 3 in our model in which the 
learning rate rapidly decreases, the learning rate starts to perform a random walk because the 
learning cost is marginal in our model (Fig. E(c)). Therefore, the behavior of our model in 
stages 2, 3, and onward does not qualify as stage 2 of the standard Baldwin effect in which the 
learning rate decreases. With a modified model with an explicit learning cost, we showed that 
the learning rate decreases after stage 3 (Fig. [6]^d)). In this case, our model naturally fits the 
framework of the Baldwin effect. 

In a previous computational model of the Baldwin effect, learning rates remain large when 
the optimal behavior dynamically changes owing to environmental fluctuations ( Ancel, 1999 ). 
In our model without an explicit cost of learning, the learning rate remains large for a different 
reason. In our model, the optimal parameter set (i.e., A\ and h) does not fluctuate after 
sufficient generations. Instead, approximate optimality is realized for various parameter sets, 
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i.e., any P < A\ < R and h > 0. Therefore, the learning rate performs a random walk to 
occasionally visit large values (Fig. Efc)). 

Godfrey-Smith points out three alternative reasons why stage 1 cannot be skipped in the 



two-stage mechanism of the Baldwin effect ( Godfrey-Smith, 2003 ). First, learning may provide 
a breathing space by which a population can survive long enough to transit to stage 2. This 
reason is irrelevant to our model because our model is not concerned with the survival of the 
population. The population size is fixed in our model such that the population always survives. 
Second, the preferred state may be accessible for learners but not for nonlearners. Although 
not explicitly stated in Godfrey-Smith (2003), this mechanism seems to be relevant to cases 
in which the fitness landscape does not depend on the configuration of the population. In our 
case, however, the fitness landscape depends on the fractions of the different types of players 
because the payoff to a player is affected by the strategies of the other players. Third, evolution 
may change the "social ecology" of the population such that learners are more advantageous 
than nonlearners, a phenomenon called niche construction in a broad sense. The social ecology 
implies a fitness landscape that depends on the configuration of the population. In our model, 
the social ecology evolves via learning of players. This third mechanism seems to be relevant 
to our model. Suppose a hypothetical population comprising stl nonlearners except two stl 
learners. For a focal stl learner, the social ecology is such that there is one stl learner and 
N — 2 stl nonlearners. If e > 0, the focal stl learner is likely to gain a payoff that is larger 
than an stl nonlearner because the focal player learns to mutually cooperate with the other 
stl learner, whereas an stl nonlearner does not. The focal stl learner would not overwhelm stl 
nonlearners if the other stl learner is absent in the social ecology. 

The main purpose of this study is to provide an evolutionary model of concrete social 
behavior in which learning plays a constructive role. We are not the first to achieve this end. 
Suzuki and Arita observed the Baldwin effect in the iterated PD game using different learning 
models ( Suzuki Arita, 2004 ). In their model, the learning rate is assumed to be binary, and 
the player's strategy is specified by a look-up table that associates the action to take (i.e., C or 
D) with the actions of the previous two rounds of the two players. The entries of the look-up 
table dynamically change when the plasticity is in operation. They also considered the effects 
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of meta-learning in which the player adapts how to update each entry of the look-up table. 
The main contribution of the present work relative to theirs is to provide a much simpler model 
in terms of the number of plastic parameters. In contrast, the learning rates and the range 
of parameters are continuous in our model, whereas they are mostly binary in their model. 
Our model may be amenable to real animals and facilitates a mechanistic understanding of 
evolutionary dynamics by the numerically calculated adaptive dynamics. Apart from the fixed 
parameters common to all the individuals, our players only have two parameters that are 
plastic within a generation, p t and A t , and two parameters inherited across generations, Ai 
and h. The results obtained from the adaptive dynamics predict those of direct evolutionary 
numerical simulations and provide an intuitive reason why learning promotes the emergence of 
mutual cooperation. In particular, we showed the necessity of the evolution of learning ability 
for cooperation by explicitly comparing the cases with and without learning. The combination 
of adaptive dynamics and evolutionary simulations may also be useful for analyzing the Baldwin 
effect in different models. 
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Appendix A: Payoff to nonlearning players 

Nowak et al. (1995) analyzed iterated matrix games between a pair of players that select an 
action (i.e., C or D) in response to the actions of the two players in the previous round. There 
are four combinations of the actions of the two players in the previous round, i.e., (C, C), (C, 
D), (D, C), and (D, D). Because a player assigns C or D to each of these possible outcomes in 
the previous round, there are 16 strategies Si (0 < i < 15). In fact, stl, st2, st3, st4, and st5 in 



the present study are equivalent to S12, Sg, Sq, Si, and S3 in (Nowak et at, 1995), respectively. 
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By calculating the steady state of the Markov chain with four states R, T, S, and P, Nowak 
et al. (1995) calculated the average payoff to focal player Si playing against the opponent Sj 
(0 < i, j < 15) under a small probability of error in action implementation. Their assumption 
for the action misimplementation is slightly different from ours. We assumed that e is the 
probability that each player independently misimplements the action, whereas only one of the 
two players may misimplement the action in a round in their model. Nevertheless, our model 
is equivalent to theirs in the limit e — > if we set e' = 2e(l — e), where e' is the probability of 
action misimplementation in the sense of Nowak et al. (1995). Therefore, our results shown in 
Table [T] are a corollary of their results. 



Appendix B: Upper bound of h for st2 players to turn 
into Pavlov 

Given (3 = oo, < e< 1, h > 0, and A\ < P, A t of the two players, denoted by X and Y, 
are sufficiently close to P when one player, which we assume to be Y without loss of generality, 
misimplements the action to select C for the first time in round t oc 2/e. Without any further 
action misimplementation, X keeps D and Y flips to D in round t + 1 because A t+l < P < A t+l . 
In round t + 2, X flips to C and Y keeps D. Therefore, we obtain A t+Z = hS + (1 — h)A t+2 , 
A™ = hP + (1 - h)A[% Ag} = hT + (1 - h)Af~\ and aJ x) w P. Combining the four 
equations, we obtain 

A™ = (T — P)h 3 - 2(T - P)h 2 + {T + S- 2P)h + P. (7) 



Using T > P > S and < h < 1, we obtain the condition for X to become Pavlov in round 

(X) 

P-S 



t + 3 as A\% > P, i.e., 



hy > 



T-P 

(Y) 



The condition for Y to become Pavlov in round t + 3 is given by A t+3 > P, i.e., 
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Equation (jSJ) implies Eq. OH]) because (P — S) / (T — P) > and h > 0. Therefore, the two 
players become Pavlov in round t + 3 if Eq. (jSJ) holds true. 

We assume that Eq. (jHJ) is violated. If Eq. (Q is also violated, we obtain A t , ,, < 
P (t' > t + 3) such that the two players mutually defect until the occurrence of another action 
misimplementation. If Eq. Q is satisfied, the two players mutually defect in round t + 3. 
Because r? + \ = rf ] = T, rffi = rffi = P, and A™ < p, we obtain P < A%\ < A^\. 
Because r[+\ — r t — S, r^ 3 = rj+| = P, and A^) 2 > Pj we obtain A^} 2 < ^t+l < -f- These 
two inequalities indicate that X and Y behave as st3 and st2 in round t + 5, respectively. By 
repeating the same procedure with X and Y swapped, we obtain 

a%<a%<p<a!%<a™ (10) 

Therefore, we obtain 

< ^t+2+4i < P < ^|+2+4i < ^i+2 > 1) (11) 

by induction. Equation (iTTjl implies that the two players do not realize mutual cooperation if 
Eq. (jSJ) is violated. 

Therefore, an upper bound of h for a pair of st2 players to turn into Pavlov is given by 
solving Eq. <M} with equality. 
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Figure 1: Average payoff to a nonlearning player (row player) playing against an opponent 
nonlearning player (column player). We set — 3, p\ — 0, e = 0.02, h — 0, t max = 200, 
R = A, T = 5, 5 = 0, and P = 2. 
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Figure 2: Behavior of a pair of learning players. We set /3 — 3, p\ = 0, and h = 0.1. (a) 
Example time courses of the aspiration level for a pair of players when e = 0. The horizontal 
lines represent A t = P = 2 and A t = R = 4. We set A\ for the two players to —1 and —0.5 
(thick lines), 2.5 and 3 (dotted lines), and 5.5 and 6 (medium lines), (b) Example time courses 
of the aspiration level when e = 0.02. We set Ai as in (a). For each pair, one of the two players 
is assumed to misimplement the action to cooperate in round 30 (thick lines), to defect in round 
30 (dotted lines), or to cooperate in round 7 (medium lines). 
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Figure 3: Discretized adaptive dynamics when e = 0.02. Plotted is ir[s', s] — ir[s,s], where 
s = (A u h) and (a) s' = (A x + 0.2, h), (b) s' = {A t - 0.2, h), (c) s' = (A u h + 0.02), and (d) 
s' = (A\, h — 0.02). (e) Example time courses of a pair of players having (A\, h) = (1.7, 0.1) and 
{A u h) = (2.2,0.1) (solid lines), and {A±,h) = (1.7,0.1) and (A u h) = (2.2,0) (dotted lines). 
The horizontal line indicates A t = P = 2. We set /3 = 3, pi = 0, and t max = 200. 
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Figure 4: Discretized adaptive dynamics when e = 0.1. Plotted is ir[s', s] — n[s,s], where 
s = (A x , h) and (a) s' = (A t + 0.2, h), (b) s' = (A x - 0.2, h), (c) s' = (A 1 ,h + 0.02), and (d) 
s' = (Ai,/i-0.02). 
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Figure 5: Effects of t max and e on the adaptive dynamics when A\ ~ P and fa = 0. We 
set = 3 and pi = 0. (a) vr[(2.1, 0), (1.9, 0)] - tt[(1.9, 0), (1.9, 0)]. (b) vr[(1.9, 0), (2.1, 0)] - 
tt[(2.1,0), (2.1,0)]. 
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Figure 6: Evolutionary dynamics in a population composed of learning players. We set 
$ = 3, pi = 0, e = 0.02, t max = 200, A Al = 0.05, and A h = 0.01. (a) Time course of 
h, YllZT' 1 \At+i — A t \/20, f/R, and the fraction of mutual cooperation, (b) Time course of 
the fraction of initially stl, st2, st3, st4, and st5 players in the run shown in (a), (c) Sample 
trajectory of the population averages A\ and h in the run shown in (a). 
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Figure 8: (a, b) Sample trajectory of A\ and /z under a linear cost of learning. We set (a) c = 1 
and (b) c = 10. (c) Largest h in each of the five runs of 10 6 generations for various values of c. 
A cross corresponds to a single run. (d) Average of h over the last 10 4 generations in each run. 
(e) Average of Ai over the last 10 4 generations in each run. 
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Table 1: Average payoff to a nonlearning player (row player) playing against an opponent 
nonlearning player (column player). We set (3 = oo, < e < 1, and t max = oo. 
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